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Abstract 

Trace-driven  cache  simulation  is  central  to  computer  design.  A  trace  is  a  very  long  sequence,  zi,  .... 
x/f,  of  references  to  lines  (contiguous  locations)  from  main  memory.  At  the  instant,  reference  Z|  is 
hashed  into  a  tet  of  cache  locations,  the  contents  of  which  are  then  compared  with  xi.  If  at  the  instant 
xi  is  not  present  in  the  cache,  then  it  is  said  to  be  a  miss  and  is  loaded  into  the  cache  set,  possibly 
forcing  the  replacement  of  some  other  memory  line,  and  malu  tg  x,  present  for  the  (( -f- 1)*'  instant.  The 
problem  of  parallel  simulation  of  a  subtrace  of  N  references  directed  to  a  C  line  cache  set  is  considered, 
with  the  aim  of  determining  which  references  are  misses  and  related  statistics. 

A  simulation  method  is  presented  for  the  Least-Recently-Used  (LRU)  policy,  which  regardless  of 
the  set  size  C  runs  in  time  0(log  A^)  using  N  processors  on  the  exclusive  read,  exclusive  write  (EIIEW) 
parallel  model.  A  simpler  LRU  simulation  algorithm  is  given  that  runs  in  0(CIog  N)  time  using  A^/log  N 
processors.  We  present  timings  of  the  second  algorithm’s  implementation  on  the  MasPar  MP-1,  a  machine 
with  16384  processors.  A  broad  class  of  reference-based  Mne  replacement  policies  are  considered,  which 
includes  LRU  as  well  as  the  Least-Frequently-Used  and  Random  replacement  policies.  A  simulation 
method  is  presented  for  any  such  policy  that  on  any  trace  of  length  N  directed  to  a  C  line  set  runs  in 
time  0(C  log  A^)  time  with  high  probability  using  N  processors  on  the  EREW  model.  The  algorithms 
arc  simple,  have  very  little  space  overhead,  and  are  well-suited  for  SIMD  implementation. 


‘This  research  was  supported  in  part  by  NASA  grants  NAG-1-1132  and  NAS-1-18605,  in  part  by-NSF  Grant  ASC  8819373, 
and- was  initiated  during  a  visit  to  AT&T  Bell  Laboratories. 
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1  Introduction 


A  cache  is  a  high-speed  memory  on  the  access  path  to  a  larger,  slower  main  memory.  Cache  performance  is 
critical  to  the  overall  performance  of  computer  systems  [10],  and  consequently  a  tremendous  amount  of  effort 
is  put  into  the  evaluation  of  cache  designs.  This  is  particularly  true  for  RISC  microprocessor  designs,  where 
the  ratio  of  the  time  needed  to  access  an  off-chip  cache  to  that  needed  to  access  the  main  memory  can  be  as 
high  as  10  [10],  and  the  off-chip  cache  is  typically  at  least  10  times  smaller  than  the  main  memory.  Trace- 
driven  simulations,  which  evaluate  cache  performance  on  actual  reference  streams  taken  from  characteristic 
programs,  are  the  most  reliable  and  widely  used  tools  for  cache  design  evaluation.  These  simulations  require  a 
great  deal  of  computation,  because  of  the  many  different  design  possibilities  that  are  simulated,  and  because 
of  the  length  of  the  reference  traces  that  drive  the  simulation^ftO}. _ g — _ _ 

Data  is  moved  between  main  memory  and  the  cache  in  contiguous  blocks  called  lines.  Every  memory  line 
is  hashed  to  some  fixed  cache  set,  but  may  be  placed  in  any  one  of  the  C  physical  cache  lines  in  the  set.  In 
emerging  computer  designs,  a  microprocessor  might  be  supported  by  a  1  Megabyte  off-chip  cache,  with  a  line 
size  of  128  bytes,  and  a  set  size  C  =  4.  A  miss  occurs  whenever  a  memory  line  is  referenced,  but  is  not  found 
in  its  set.  The  cache  hardware  then  fetches  the  desired  line  from  main  memory,  overwriting  another  line  in 
the  same  set  if  the  set  is  full.  The  rule  used  to  select  which  line  to  replace  is  called  the  replacement  policy. 

An  effective,  widely  used  policy  is  Least-Recently-Used  (LRU),  which  simply  replaces  the  line  accessed  least 
recently.  The  objective  of  a  trace-driven  simulation  is  to  determine  which  references  in  the  trace  are  misses. 
Given  the  identities  of  the  misses,  statistics  of  chief  interest  in  cache  design  are  easily  computed,  such  as  the 
fraction  of  read  misses,  the  fraction  of  write  misses,  and  the  number  of  write-backs  (stores  of  modified  lines) 
from  cache  to  main  memory. 

Ilcidelberger  and  Stone  [9]  showed  that  it  is  valuable  to  simulate  a  long  trace  directed  to  a  few  sets,  when 
cache  miss  statistics  between  sets  are  highly  correlated.^  High  correlation  removes  the  need  to  simulate  all 
sets,  but  also  removes  the  easy  parallelism  that  might  be  exploited  by  simulating  a  large  number  of  sets  in 
parallel  on  different  processing  elements  (PEs).  A  massively  parallel  method  to  handle  the  simulation  of  a 
long  trace  targeted  to  a  single  set  allows  more  powerful,  flexible  solutions. 

We  consider  the  problem  of  determining  the  misses  in  a  given  reference  trace,  xi  ,  ...,  x^,  directed  to  x 
a  set  of  size  C.  An  algorithm  is  presented  (Section  3)  that  solves  this  problem  in  C>(logAf)  time  using! 
N/\ogN  PEs,  on  the  exclusive-read,  exclusive-write  (EREW)  model  of  a  parallel  machine.  The  algorithm 
and  its  complexity  do  not  depend  on  C.  The  algorithm  computes  the  stack  distance  At  associated  with  each 
reference  Xf  [16].  If  Xf  is  not  a  first  reference  to  a  line  then  At  is  the  smallest  set  size  for  which  Xt  would  be 
a  hit;  otherwise  At  =  oo. 


□ 


In  Section  4,  we  present  an  alternative  LRU  simulation,  with  running  time  0(0  log  N)  time  using  X/  log  N  ^ 


*  Recent  experiment*  (private  communication  from  Harold  Stone)  have  validated  that  high  correlation  exUl*  between  ict*,  —  ■ 

but  have  also  shown  that  speciai  care  must  be  taken  when  selecting  the  sets  which  are  analyzed,  as  the  measured  mis*  ratio  '  _ 

from  an  arbitrary  set  simulation  may  not  be  an  acciu-ate  predictor  of  the  overall  miss  ratio.  ^  CodeS 
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PEs  on  the  EREW  model.  The  algorithm  computes  the  stack  distance  at  level  C,  A((C),  for  each  reference 
xtr  U  Xf  is  not  a  first  reference  to  a  line  then  A((C)  is  the  smallest  set  size  <  C  for  which  Xf  would  be  a  hit; 
otherwise  A((C)  =  oo.  The  algorithm  is  simple  and  the  implicit  constant  in  the  time  bound  is  favorable. 
\Vc  report  timings  of  this  algorithm’s  performance  on  a  MasPar  [4]  SIMD  computer  having  16384  PEs. 

In  Section  5,  a  broad  class  of  reference-based  replacement  policies  is  considered.  Roughly,  the  class 
contains  all  stack  replacement  policies  where  priorities  controlling  line  replacement  are  static  and  can  be 
computed  efficiently  in  parallel.  This  class  includes  LRU  as  well  as: 

•  OPT:  Replace  the  line  referenced  most  remotely  in  the  future.  This  unrealizable  policy  provably 
minimizes  the  number  of  misses.  Its  simulation  gives  a  baseline  against  which  realizable  policies  can 
be  measured. 

•  Lcast-Frcquently-U.sed  or  LFU:  Replace  the  line  accessed  least  often  in  the  past.  Tics  can  be  broken 
by,  for  example,  giving  higher  priority  to  the  reference  that  has  been  in  the  cache  the  shortest  length 
of  time. 

•  Random:  Replace  one  of  the  C  lines,  chosen  independently  and  uniformly  at  random.  Random  re¬ 
placement  is  easy  to  implement;  furthermore,  there  is  evidence  that  if  the  total  number  of  lines  in  the 
cache  (not  just  the  lines  in  one  set)  is  sufficiently  large,  the  policy  works  nearly  as  well  as  any  other 
implcmentablc  policy[10]. 

In  Section  5,  an  algorithm  is  presented  for  reference-based  policy  simulation.  Given  any  trace  of  N  references 
targeted  to  a  C  line  set,  the  algorithm  runs  in  time  in  0(C  log  Af)  with  high  probability  using  N  PEs  on 
the  ERE>W  model.  (The  algorithm  is  probabilistic,  the  choice  of  trace  is  not.)  In  Section  5.3,  we  extend  the 
class  of  reference-based  replacement  policies  to  include  an  aging  mechanism,  whereby  stale  lines  lose  priority 
and  tend  to  be  flushed  from  the  cache.  Accommodating  this  mechanism  increases  the  algorithm’s  running 
time  to  0(C' log^  N). 

Our  algorithms  are  simple,  require  at  most  O(logiV)  space  per  PE,  and  break  the  computation  down 
into  calls  to  a  few  primitive  parallel  subroutines.  As  a  result  the  algorithms  are  well-suited  for  SIMD 
architectures,  such  as  the  Connection  Machine  [II]  or  MasPar  [4].  The  0{C\o$N)  with  high  probability 
bound  holds  because  we  have  assumed  that  a  fast  probabilistic  parallel  algorithm  [18]  is  used  to  solve  a 
certain  trapezoidal  decomposition  problem  (Sections  2,  5).  Adopting  the  notation  of  [18],  this  algorithm 
runs  in  O(logiV)  time  using  N  PEs,  meaning  that  there  is  a  constant  k  such  that  the  time  exceeds  kmlogN 
with  probability  less  than  N~'"  for  any  m  >  1.  In  practice,  simpler,  deterministic  methods  may  do  better, 
while  raising  the  asymptotic  time  bound  to  0(Clog^  N). 

For  simplicity,  we  have  assumed  the  problem  size  N  is  comparable  to  the  number  of  PEs,  so  that  it  is  as 
if  each  PE  handles  a  few  references  (up  to  logiV).  However,  a  “supersaturated”  setup  [8]  may  be  effective 
in  practice,  where  a  large  block  of  consecutive  references  would  be  loaded  in  the  local  memory  of  each  PE. 
Our  algorithms  generalize  to  that  setup,  by  using  efficient  supersaturated  implementations  of  the  underlying 
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parallel  primitives  (cf,  [12,  17]).  Indeed,  our  implementation  of  the  LRU  algorithm  is  a  supersaturated 
one,  with  complexity  0{C{N / P  log^))  for  a  reference  trace  with  N  elements  on  an  architecture  with  P 
processors. 

Collecting  the  cache  miss  statistics  mentioned  above  adds  just  O(IogA^)  time.  Moreover,  by  the  nature 
of  the  replacement  policies  and  the  simulation  methods,  statistics  for  each  set  size  up  to  C  can  be  computed 
at  this  cost.  All  of  our  algorithms  can  be  adapted  for  efficient  simultaneous  simulation  of  many  sets,  by  the 
simple  device  of  initially  sorting  the  references  on  the  basis  of  their  set  identifiers. 

Ilcidelberger  and  Stone  [9]  had  the  original  insight  that  trace-driven  simulation  of  an  LRU  cache  set 
could  be  parallelized.  Their  algorithm  is  intended  for  a  network  of  P  MIMD  processors,  and  requires  P  N 
for  good  speedup.  Our  work  w.\s  motivated  by  theirs;  our  algorithms  are  different,  apply  to  a  larger  class 
of  replacement  policies,  and  to  a  different  class  of  architectures.  Lin,  Baer,  and  Lazowska  have  considered 
parallelizing  cache  simulations,  in  the  context  of  multiprocessor  cache  protocolsflb).  Their  method  assumes 
that  each  individual  processor’s  cache  is  simulated  on  a  different  PE,  so  that  the  degree  of  parallelism  is 
limited  to  the  number  of  caches  in  the  simulated  system.  An  important  and  beautiful  paper  on  cache 
simulation  was  published  in  1970  by  Mattson,  Gecsei,  Slutz,  and  Traiger[16].  Most  of  our  notation  is  taken 
from  that  paper. 

The  practical  utility  of  implementing  trace-driven  cache  simulations  on  today’s  SIMD  computers  has  yet 
to  be  shown,  although  our  implementation  proves  the  great  promise  of  the  approach.  It  seems  likely  that  a 
very  long  reference  trace  will  have  to  be  partitioned  into  blocks,  where  one  block  is  processed  at  a  time.  The 
I/O  problem  is  to  move  the  blocks  to  the  processors  fast  enough  to  keep  them  busy.  An  attractive  alternative 
is  to  use  a  synthetic  trace;  for  example  Thiebaubt,  Stone,  and  Wolf  [20]  recently  proposed  a  simple  method 
for  random  generation  of  realistic  traces. 

2  Preliminaries 

2.1  Cache  Notation 

Henceforth,  we  focus  on  a  single  set  cache,  and  treat  its  size  C7  as  a  parameter.  Let  Bt{C)  denote  the  set  of 
lines  stored  just  after  reference  Xf  Each  reference  must  be  cached,  so  Xt  G  Bt{C)  for  all  C  >  1  and  t  >  1. 
(By  convention,  B((0)  is  the  empty  set.)  If  the  cache  is  full  (|B<(C)|  =  C)  and  xt  is  a  miss  (xj  ^  B»_i(C)) 
then  X(  icplaces  a  line  in  B(_i(C),  We  refer  to  this  replaced  line  as  i/(.  All  of  the  replacement  policies  we 
consider  are  stack  policies  [16],  meaning  if  a  reference  is  a  hit  given  that  the  cache  size  is  C  then  the  reference 
will  remain  a  hit  if  the  cache  size  is  increased  to  C  +  1.  That  is, 

BtiC)  C  B,iC  + 1)  forallC^O. 

This  inclusion  allows  us  to  order  the  lines  of  the  cache  by  the  least  size  needed  for  their  appearance. 
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Figure  1:  An  example  of  the  LRU  rule  acting  on  an  N  =  18  line  trace,  with  lines  labeled  a  —  d;  0  is  the 
empty  line  marker. 


Define  the  clement  of  Bt{C)  as 

a,(i)  = 


Bt{i)  -  Dtii  -  1)  \i\Bt{i)\  =  i 

0  otherwise 


(1) 


The  symbol  0  is  an  empty  line  marker;  S((i‘)  =  0  if  fewer  than  i  lines  belong  to  We  assume  the  stack 

starts  empty,  and  therefore  let  tfo(*)  =  0  for  all  i  >  0. 

Figure  1  gives  an  example,  for  the  LRU  replacement  policy.  The  trace  length  N  =  18,  and  the  lines  are 
labeled  a  —  d.  The  first  level,  S((l)  coincides  with  the  trace  itself.  Consider  the  first  two  levels;  i.e.,  the  cache 
contents  giv<jh  C  =  2.  The  cache  is  initially  empty.  The  first  two  references,  xi  =s  a,  *2  =  c,  miss,  and  as 
a  result  J3i(2)  =  {a},  jB2(2)  =  {a)C}.  The  third  reference,  *3  =  6,  also  misses,  forcing  the  replacement  of 
2/3  =  a,  yielding  53(2)  =  The  fourth  reference,  X4  =  c  hits,  so  5^(2)  =  53(2),  and  so  forth. 

Hit  and  miss  statistics  are  easily  extracted  from  stack  distances,  defined  as  follows.  Let  the  level  i  stack 
distance  A((t)  denote  the  smallest  cache  size  <  t  such  that  Z(  is  a  hit,  or  00  if  xi  is  a  miss  for  cache  size 
i.  Thus,  given  At(C),  hr  t  =  1,  ....  N,  we  can  extract  the  hits  for  any  cache  size  c  <  C,  More  generally, 
define  the  stack  distance  A(  =  limc—oo  ^((C)  to  be  the  smallest  cache  size  such  that  X(  is  a  hit,  or  00  if 
there  is  no  prior  reference  to  the  same  line  {x,  =  x»  for  some  s  <t).  Stack  distances  are  shown  in  Figure  1. 


2.2  Parallel  Processing  Model 

Our  algorithms  are  well-suited  for  a  wide  variety  of  parallel  architectures,  because  the  algorithms  transform 
data  in  simple  ways  using  a  small  number  of  basic,  highly  parallelizable  operations.  However,  to  state 
precise  time  and  processor  requirements,  we  must  choose  a  precise  model  of  parallel  computation.  The 
EREW  (exclusive  read,  exclusive  write)  model  (cf.  (14])  provides  a  nice  blend  of  simplicity  and  realism. 
In  this  model,  the  PEs  operate  in  lockstep.  There  is  a  global  shared  memory,  supporting  at  unit  cost  any 
pattern  of  accesses  except  those  where  two  PEs  simultaneously  access  the  same  location. 

We  state  the  complexity  of  our  algorithms  with  respect  to  the  EREW  model.  We  now  list  the  parallel 
.subroutines  used  in  our  algorithms,  and  their  complexities  on  the  EREW  model, 

•  Merging:  Two  sorted  lists  each  of  length  N  can  be  merged  in  time  O(logA^)  using  N/hgN  PEs,  via 
Batcher’s  odd-even  merge  algorithm  [2]. 
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•  Sorting:  A  list  of  length  N  can  be  sorted  in  time  O(logJV)  time  using  N  PEs  [5]. 

•  2d  Ranking:  Given  points  i  =  1,  N,  compute  for  each  point  its  rank,  the  number 

of  other  points  ixj,yj)  strictly  above  and  to  the  right:  <  xj  and  yt  <  yj.  In  slightly  different  form, 
this  is  the  problem  of  computing  the  empirical  cumulative  distribution  function  (ECDF),  considered 
in  [3].  The  multidimensional  divide-and-conquer  setia.]  algorithm  given  in  [3]  parallelizes  easily  to  solve 
the  problem  in  0(log^  N)  time  using  N  PEs.  Atallah  et  al.  improved  on  this,  lowering  the  timet  to 
0(logA'^)  using  N  PEs. 

•  Closest  Larger  Right  Neighbor  (CLRN)  Problem:  Given  input  numbers  oj,  ...,  a^f,  find,  for  each  a<, 
the  index  of  the  first  larger  number  to  the  right;  i.e.,  for  each  i=l,...,Ar  —  1,  compute  =  min{j  > 
i  :  Xj  >  X{}  if  there  is  some  j  >  i  with  xj  >  X{,  bi  =  N  +  I  otherwise.  The  CLRN  problem  can  be 
reduced  to  trapezoidal  decomposition  [18]:  given  a  set  of  line  segments  and  points,  from  each  point, 
report  the  line  segment  first  hit  (if  any)  by  a  ray  shot  horizontally  to  the  right.  To  make  the  reduction, 
con.sider  the  polygonal  path  connecting  consecutive  points  (», a<),  »  =  1,  ....  TV.  If  Of+i  >  a<  then  we 
know  bi  '=  ai+i.  Otherwise,  b{  is  the  height  of  the  right  end  point  Oj  of  the  segment  from  {j  —  l,aj_i) 
to  (j,  O;)  first  hit  by  the  ray  shot  horizontally  to  the  right  from  (»,  a^),  if  there  is  an  aj  >  a,-,  j  >  i. 
If  not  then  bi  =  N  +  1.  Reif  and  Sen  give  a  probabilistic  algorithm  for  trapezoidal  decomposition. 
Applying  that  algorithm  to  the  CLRN  problems  yields  its  solution  in  O(logN)  time  using  N  PES, 
Alternatively,  the  CLRN  problem  can  be  solved  by  a  binary  search-like  algorithm,  given  in  Section  5, 
in  O(log^  N)  time  using  N  PEs. 

•  Parallel  Prefix  (scan,  segmented  scan);  Given  inputs  ai, . . . ,  and  an  associative  operator  o,  compute 
the  partial  products  pi,  ...pN  where  p,-  =  aioa^o..  .oa,-.  Solutions  to  this  parallel  pre/ir  problem  [13] 
are  commonly  called  scan  computations.  The  problem  can  be  solved  in  O(logiV)  time  using  N/\ogN 
PEs  [12], 

A  variation  breaks  the  products  over  the  indices  [I,A^]  into  segments  over  these  indices,  with  the 
segment  boundaries  also  given  as  inputs.  For  example,  an  additional  vector  bi,  . . . ,  brf,  ia  given  where 
bi  =  0,  for  i  >  1,  bi  is  either  0  or  1,  and  the  O’s  mark  the  segments’  left  boundaries.  Specifically,  if 
bi  =  0  then  pi  =  a,-;  otherwise,  pi  =  aj  o  aj+i  o  ...oat  where  j  is  the  largest  index  A:,  1  <  it  <  i,  such 
that  fijt  =  0.  The  segmented  problem  has  the  same  complexity  as  the  original.  In  the  algorithms  below, 
we  use  copy-scans  defined  by  a  o  /?  =  a,  and  add-scans  where  o  is  addition. 

None  of  the  algorithms  listed  above  requires  more  than  0(log  A’)  space  per  PE. 

3  Fast  Parallel  LRU  Simulation 

In  this  section  we  present  a  fast  parallel  algorithm  for  computing  stack  distances  under  the  LRU  replacement 
policy. 
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LRU  may  be  characterized  as  follows.  Reference  to  or  =  places  a  at  the  first  level  of  the  stack.  Until 
Or  is  referenced  again,  it  can  only  move  down  in  the  stack.  Specifically,  after  a  has  been  pushed  to  level  i  it 
remains  there  until  a  reference  is  made  either  to  a  (moving  a  to  level  1)  or  to  a  line  not  stored  in  levels  1 
through  t  —  1  (moving  a  to  level  i  +  1).  As  a  result,  the  stack  distance  A(  is  one  greater  than  the  number  of 
distinct  line#  in  the  subtrace  between  t  and  the  closest  prior  reference  to  a  (or  oo  if  there  is  no  prior  reference 
to  a).  For  example,  in  Figure  1,  consider  the  consecutive  references  to  line  6  at  t  =  3  and  i  =  10.  The  stack 
distance  Aio  =  4  because  3  distinct  symbols  belong  to  the  subtraee  . ze.  More  generally,  letting 


prev{i)  = 


( 


max{s  <t:xt  =  xi) 
0 


if  X,  =  Xf  for  some  s  <  t 
otherwise 


we  obtain 


I  1+  number  of  distinct  symbols  in  Zpr(»(()+i> '■ 
I  oc 


if  prev{t)  >  0 
otherwise 


Let  us  take  a  geometric  view  of  this  new  problem  of  counting  distinct  symbols  within  subtraces.  As 
illustrated  in  Figure  2,  identify  each  reference  Xt  with  the  point  (f,nezf(t)),  where 


ncxl(t)  s! 


max{s  >  t :  Xt  ~  Xt)  if  x,  =  for  some  a  >  t 
yV  +  I  otherwise 


Note  that  the  last  references  to  symbols  within  the  subtrace  Xpr«v(i)+i . .  arc  identified  by  those  points 

{8,nexl(8))  satisfying 

prev{t)  <  a  <  t  <  nexl(a). 


These  are  the  points  that  lie  strictly  within  the  rectangle  with  lower  left  hand  corner  (prev(t),t),  lower  right 
hand  corner  (l,t)  and  sides  extending  upwards  to  (prev(i),N  +  1)  and  (t,N  +  1).  Again,  see  Figure  2. 
Counting  these  points  reduces  to  2d-ranking.  Specifically,  suppose  we  know  the  2d-rank,  rank(u,v),  of  each 
point  (u,  v)  in  the  union  of  sets  {(t,nei<(t)) :  1  <  t  <  N}  and  {(t,0  :  1  <  t  <  yV).  Then,  the  stack  distance 

^  _  r  rank(prev(t),t)  —  rank(t,i)  if  prev(t)  >  0 
‘  ~  1  oo  otherwise 

We  see  from  Figure  2  that  Aio  =  4  because  rant(3, 10)  =s  20  and  ron*(10, 10)  =  16. 

Now,  let  us  present  the  detailed  simulation  method.  Suppose  that  the  trace  is  initially  stored  in  the 
yV-vector  x.  We  use  the  additional  yV-vectors  p,  next,  prev,  and  A.  Initially,  let  p<  =  t,  so  X(  identifies 
the  line  and  p(  the  trace  index  of  reference  z«.  Vectors  next,  prev,  end  A  will  hold  permuted  copies  of  the 
vectors  next,  prev,  and  A,  respectively.  The  algorithm  is  as  follows. 


1.  [Compute  next  and  prev.]  Sort  the  tuples  (x,,pi)  using  X|  as  the  primary  key  and  pt  as  the  secondary 
key:  (x<,p()  <  (x,,p,)  if  either  x<  <  x,  or  x<  =  x,  and  p«  <  p,.  Thus,  the  data  now  in  location  t  of  x 
and  p  was  in  location  p(  before  the  sort.  For  all  t  =  1,  . . . ,  yV,  set 


{p<_i  if  <>  1  and  x<_i  =  x< 
0  otherwise 
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Figure  2:  The  trace  of  Figure  1  is  repeated,  along  with  corresponding  ntxt  and  prtv  values.  The  values 
{t,nexi{i))  are  plotted  as  +’8,  and  the  values  {i,t)  as  o’s.  The  number  of  points  strictly  within  the  rectangle 
indicated  by  dashed  lines  is  one  less  than  the  stack  distance  of  xjo  =  b. 

j  p<+i  \ft<N  and  Xr+i  =  Xt 
"  I  AT  -F  1  otherwise 

At  this^point,  the  prev  and  next  vectors  hold  permuted  copies  of  the  prev  and  next  vectors  discussed 
above. 

2.  [2d-rank.]  Compute  the  2d-rank8  of  the  set  of  points 

{(p„next,)  :  1  <  <  <  U  {(t.t) :  1  <  <  <  iV} 
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and  set 


Ai  =  rank(pr»v„pt)  -  rank(pt,p,). 

As  a  result,  At  =  Ap, ,  which  completes  the  computation. 

Sorting  within  the  first  step  costs  0(log7V)  time  on  JV  PEs,  using  the  EREW  model  [5].  The  next  and 
prev  conxputations  may  be  done  within  the  same  time  and  processor  bounds  using  segmented  copy-scans, 
with  changes  in  the  x  vector  marking  the  segment  boundaries.  The  2d-ranking  within  the  second  step  costs 
C>(log  A)  time  on  N  PEs  [1].  Thus: 

Theorem  1  On  ike  FREIV  model,  given  ike  trace  zi,  ...,  zs,  ike  a$BOciated  stack  distances  Ai,  ...,  A;v 
induced  under  ike  LRU  replacement  policy  can  be  computed  in  O(logM)  time  using  N  PEs. 


Aiming  for  a  simpler  implementation  and  smaller  implicit  constants,  we  may  sacrifice  a  \ogN  factor  in 
the  running  time.  The  natural  parallelization  of  Bentley’s  multidimensional  divide  and  conquer  method  [3] 
gives  a  2d-ranking  algorithm  that  runs  in  0(log^  N)  time  using  N  PEs.  Using,  for  example,  Batcher’s  sorting 
method  [2]  requires  time  0(Iog*  N)  on  N  PEs. 

4  Parallel  Simulation  of  LRU  Level  by  Level 


An  alternative  approach  is  to  simulate  LRU  level  by  level,  at  the  iteration  computing  the  level  i  cache 
contents  «i(j))  •  •  •  i  «jv(»)  and  stack  distances  Ai(t),  . . . ,  AAf(i).  Assuming  a  set  size  of  C,  the  final  results 
are  the  stack  distances  Ai(C),  . . .,  Ajv(C). 

Define  reference  i|  to  be  a  prior  kit  (prior  miss)  at  level  i  if  Z(  is  a  hit  (miss)  given  that  the  cache  size  is 
i  —  1.  That  is,  ij  is  a  prior  miss  at  level  i  if  ^  Bt-i(i  —  1).  If  X|  is  a  prior  hit  at  level  i  then  Aj(j  —  1)  <  »; 
otherwise  At(i)  =  oo.  In  Figure  3,  we  have  marked  the  prior  hits  Zt  at  level  3  by  underscoring  the  symbol  at 
level  2  in  column  i.  In  studying  this  figure  one  should  remember  that  an  underscore  on  symbol  S((t)  means 
that  symbol  Xt  was  a  hit  in  a  (t  —  l)-line  cache,  not  that  st(i)  was.  The  placement  of  underscores  was  chosen 
to  highlight  the  propagation  of  a  symbol  across  a  sequence  of  prior  hit  positions,  to  be  described  below.  For 
example,  of  the  first  ten  references  four  are  prior  hits  at  level  3 — zi,  zs,  Z7,  and  za — because  c  (=  Z4,xt,X7) 
is  found  in  ^3(2),  ^4(2)  and  B«(2),  and  symbol  a  (za)  is  found  in  ^7(2). 

Any  prior  hit  at  level  1  —  1  is  also  a  prior  hit  at  level  i  (for  example,  *5  in  Figure  3).  Under  LRU,  the 
other  prior  hits  at  level  i  are  the  references  xj  satisfying  X|  =  sj_i(t  —  1);  i.e.,  the  references  that  hit  at  the 
last  level  of  the  size  i  -  1  cache  (for  example,  x^  in  Figure  3). 

The  key  to  the  simulation  method  is  that  under  LRU,  for  all  f  >  1  and  i  >  1, 


«((«) 


s<_i(i  -  I)  if  Z(  is  a  prior  miss  at  level  1 
5f_j(i)  otherwise 


where 

5,(1)  =  Xc,5o(l)  =  0. 
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Figure  LRU  acting  on  an  18  line  trace,  assuming  a  cache  size  C  =  3.  Each  prior  hit  xj  at  level  3  is 
identified  by  underscoring  St(2). 

To  see  this,  suppose  Xt  ^  -  1).  The  LRU  rule  puts  Xj  into  level  one,  and  shifts  lines  1,  2,  ...,  i  —  1 

down  one  level,  which  pushes  S(_i(i  —  1)  to  level  i.  On  the  other  hand,  if  x»  is  a  prior  hit  at  level  t,  the  cache 
update  leaves  level  i  unchanged.  In  Figure  3,  we  see  that  the  S’"**  and  the  6‘**  references  are  prior  misses  at 
level  3,  and  that  the  intervening  references  are  prior  hits.  As  a  result,  83(2)  =  a  enters  level  3  at  t  =  3  and 
propagates  over  prior  hits  at  level  3  until  i  =  6,  where  it  is  replaced  with  85(2)  =  6,  which  in  turn  propagates 
up  through  I  =  8. 

We  now  describe  the  simulation  algorithm,  taking  special  care  with  the  details  because  similar  meth¬ 
ods  arc  needed  in  Section  5.  At  the  (i  —  1)*‘  iteration  we  will  overwrite  vectors  s  =  (sq,#! . Bft) 

and  d  =  (di,d2, . .  .,dyv)  with  the  level  i  cache  contents  and  stack  distances,  (80(1), 81(1),..., 8Ar(i))  and 
(Ai(j),  A2(»),,..,A;v(0)i  respectively.  A  vector  x  =  (xj.xj, . .  ..x^r)  holds  the  trace  (xi,  Xj, . . . ,  x^:),  and 
another  vector  u  =  (ui,...,u.v)  will  hold  a  copy  of  (8o(t  —  1),  8i(i  --  1), . .  .,8Ar-i(i  -  1)).  To  initialize  the 
computation,  for  t  =  1,  . . . ,  A,  set  d,  =  00,  and  Sf  —  Xt,  So  =  0.  For  t  =:  1,  . . . ,  C,  do  as  follows.  , 

1.  [Update  the  Level  i  Cache  Contents  via  equation  (2).]  For  t  =  1,  . . . ,  N,  set  tI(  =  Sf-i.  For  t  =  1,  . . . , 
N,  if  d,  ^  00  then  set  S(  =  8»_i;  otherwise,  S(  =  U|.  This  is  to  be  understood,  but  not  implemented, 
as  a  serial  update:  first  sx  is  updated,  then  83,  and  so  forth. 

2.  [Update  the  Stack  Distances.]  For  <  =  1,  . . .,  AT,  set  d|  =  f  if  d,  =00  and  Uj  =  Xi;  otherwise  leave  d< 
unchanged. 

The  right  shift  of  s  into  u  in  step  1  and  the  update  to  the  stack  distances  in  step  2  are  naturally  parallel 
operations.  The  update  of  s  is  a  segmented  copy-scan,  with  the  coordinates  i  with  d(  =  00  marking  the 
segment  boundaries.  Hence,  the  cost  of  both  steps  is  just  O(logAf)  time  using  N/\ogN  PEs.  As  there  are 
a  total  of  C  iterations  to  perform,  we  obtain: 

Theorem  2  On  the  EREW  model,  given  the  trace  xi, . . . ,  xn,  the  associated  level C  stack  distances  Ai(C),  = 
. . .,  A^r(C)  under  the  LRU  replacement  policy  can  be  computed  in  C>(ClogAf)  time  using  AT/logAf  process 

sors. 

We  implemented  this  algorithm  on  a  MasPar  MP-1  computer  [4],  with  16384  PEs.  Each  PE  is  a  4-bit 
processor  with  a  clock  cycle  of  80  nanoseconds.  A  typical  integer  operation  such  as  those  common  in  our 
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Table  1:  Execution  time  of  the  LRU  algorithm  on  a  MasPar  MP-1  with  2^^  PEs,  in  milliseconds,  as  a  function 
of  trace  length  and  set  size 


algorithm  requires  a  few  ten’s  of  clocks. 

Our  implementation  supports  “super-saturation”  of  the  PE’s,  as  described  earlier.  The  PE  memory  size 
permit  us  to  assign  as  many  as  2048  references  to  each  PE,  thereby  permitting  the  simultaneous  simulation 
of  a  trace  with  over  33  million  references.  The  performance  data  we  present  includes  only  the  time  spent  in 
the  solution  phase  of  the  algorithm.  The  traces  were  generated  randomly.  For  a  given  trace  length  and  cache 
set  size  we  observed  a  10-15%  increase  in  running  time  between  caches  with  a  verv  low  hit  ratio,  and  caches 
with  a  high  hit  ratio.  This  is  likely  due  to  the  fact  that  long  segments  accompany  high  hit  ratios,  requiring 
greater  inter-PE  communication  to  implement  the  copy-scan.  The  timings  presented  are  from  traces  with 
nearly  perfect  hit  ratios,  and  so  represent  an  upper  bound  on  the  timing  one  might  expect  from  an  actual 
trace. 

A  full  implementation  would  have  to  spend  time  loading  the  trace;  the  I/O  time  required  depends  on  the 
available  I/O  hardware  and  the  orgrmization  of  the  trace  on  the  I/O  devices.  In  light  of  our  timings,  it  is  clear 
that  moving  the  trace  onto  the  machine  may  well  be  the  most  serious  bottleneck  an  actual  implementation 
would  face. 

Our  experiments  vary  the  length  of  the  trace  from  2^^  to  2**,  and  the  set  size  C  from  2  to  32.  The 
presented  timings  are  averages,  given  in  milliseconds,  taken  by  executing  the  solution  loop  many  times  in 
succession. 

Observe  that  about  five  seconds  of  execution  time  were  required  to  analyze  the  behavior  of  a  32-Iine  set 
on  a  trace  with  2^®  =  33, 554, 432  references.  This  is  710  times  faster  than  the  solution  time  (with  trace 
generation  costs  subtracted  off)  of  an  optimized  serial  algorithm  we  implemented  on  a  Sparc-1-1-  workstation. 
These  timings  demonstrate  the  remarkable  promise  of  massive  parallelism  for  trace-driven  cache  simulation. 

5  Reference  Based  Replacement  Policies 

We  now  broaden  the  scope  of  our  methods,  to  handle  a  large  class  of  line  rep!cM.ement  policies,  which  we 
term  reference-based.  In  Section  5.3,  we  extend  the  class  to  handle  policies  that  iJlow  priorities  associated 
with  cached  lines  to  “age”  so  that  stale  lines  tend  to  be  flushed. 
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Mattson  et  al.  [16]  show  that  a  stack  policy  is  obtained  if  a  numerical  priority  P(««(»))  «  assigned  to  each 
line  ^((t)  at  reference  xt,  and  the  line  j/t(C)  chosen  for  replacement  on  loading  xt  is  the  one  with  least  priority 
among  the  members  of  Bt~i{C).  In  the  class  of  policies  we  now  consider,  a  line’s  priority  is  established  at 
the  point  it  appears  in  the  reference  stream,  after  which  it  remains  constant  until  the  line  is  referenced  again. 
Of  course  we  must  be  able  to  calculate  the  priorities  from  the  reference  trace.  Thus  we  limit  attention  to 
policies  that  support  efficient  parallel  priority  calculations.  As  practical  policies  seem  to  use  very  simple 
priority  assignments,  this  limitation  is  mild.  Here,  “efficient”  means  within  the  resource  bounds  needed  for 
the  rest  of  our  simulation  method:  0()ogiV)  time  using  N  processors.  Recall  (Section  2.2)  that  d{\ogN) 
means  O(log  N)  with  high  probability. 

Let  us  define  the  class  of  reference-based  replacement  policies  as  those  stack  policies  induced  by  priorities 
satisfying  the  following  conditions. 

Rl:  All  P(xt)  values  can  be  computed  quickly  in  parallel:  in  6(logyV)  time  using  N  PEs.  For  example, 
the  priorities  for  LFU  (Least  Frequently  Used)  can  be  established  with  a  sort  on  the  reference  tags, 
followed  by^  a  segmented  sum-scan. 

R2:  A  line’s  replacement  priority  does  not  change  except  when  the  line  appears  in  the  reference  stream. 

Several  important  replacement  policies  are  reference-based,  including 

•  LRU:  P(xt)  =  t. 

•  LFU:  P(xt)  =  Count(r<,<),  the  number  of  references  =  xj  for  u  <  f.  Ties  can  be  broken,  for 
example,  by  lexicographic  ordering  of  the  lines,  or  by  giving  higher  priority  to  the  line  that  has  been  in 
the  cache  the  shortest  length  of  time.  (P(xt)  =  Count(xj,<)—  l/(f-bl)  would  serve  the  latter  purpose.) 

•  OPT:  P(xi)  is  the  negation  of  the  smallest  index  u>t  such  that  x„  =  X|. 

In  addition,  the  Random  replacement  (RR)  policy  shares  most  of  the  properties  we  need  to  quickly  simulate 
reference-based  policies,  and  we  include  it  in  this  class  as  a  special  case.  Under  RR,  priorities  are  chosen 
that  determine  a-  uniform  random  ranking  of  the  cache  contents;  details  are  given  below. 

Figure  4  gives  an  example  of  the  operation  of  LFU  with  ties  broken  by  lexicographic  ordering,  a  <  b  < 
c  <  d.  A  line’s  subscript  equals  the  number  of  earlier  references  to  the  line.  First,  note  that  the  stack  order 
and  the  priority  order  may  differ.  A  line  with  low  priority  can  be  buried  in  the  middle  of  the  stack  order; 
for  example,  line  d  at  f  =  13.  There  are  important  departures  from  the  behavior  of  the  LRU  policy.  Under 
LRU,  the  replaced  line  is  the  one  at  the  lowest  stack  level.  Here,  we  see  that  if  the  cache  size  is  2  then  at 
t  =  9,  line  d  misses  and  replaces  line  a  at  the  first  stack  level,  leaving  line  e  in  place  at  the  second  stack  level. 
To  illustrate  the  entry  and  propagation  of  lines  across  a  given  level,  each  prior  miss  xt  at  level  3  is  marked 
by  underscoring  8t{2).  As  in  LRU,  a  line  propagates  across  all  prior  hits.  Unlike  LRU,  a  line  may  propagate 
across  some  prior  misses.  For  example,  line  a  enters  level  3  at  t  =  9  and  propagates  until  i  =  15,  across  the 
prior  misses  at  t  =  10  and  t  =  12. 
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Figure  4:  LFU;  ties  are  broken  hy  a  <  b  <  c  <  d.  The  subscripts  count  the  number  of  earlier  references,  and 
underscores  mark  prior  hits  at  level  3.  Also  shown  are  the  stack  distances  Aj(2)  at  level  2  and  the  lowest 
priority  lines  y((2)  G  at  level  2. 


By  convention,  the  priority  of  0,  the  empty  line  marker,  is  — oo.  As  before,  we  assume  so{i)  =  0  for 
i  —  1, . . . ,  C.  It  follows  from  the  analysis  of  [16]  that  a  stack  policy  is  induced  by  the  rule  stating  that  the 
line  of  least  priority  is  selected  for  replacement.  We  refer  to  line  yt{C)  as  a  replacee,  and  define  yt{C)  to  be 
the  least  priority  line  in  Bt-i(C)  if  B,^i(C)  is  full;  i.e.,  lBt_i(C)l  =  C.  If  is  not  full  then  we  let 

yt(C)  =  0.  In  studying  our  notation  it  is  important  to  remember  that  yt(C)  refers  to  a  line  with  a  particular 
property  in  the  cache  after  reference  Zt-i,  not  after  x*.  This  convention  follows  Mattson  et  al.  [16]. 

A  simple  recurrence  determines  the  level  i  cache  contents.  Given  two  cached  lines  a  and  /?,  let  maxp{at,  /?} 
select  the  one  with  higher  priority.  For  all  t  >  0,  «t(l)  =  X(.  For  all  t  >  1  and  t  >  0, 


S((0  =  S 


s«-i(0 

y<(»-  1) 

maxp{y,(»-  l),s,_i(i)} 


if  Xt  is  a  prior  hit  at  level  » 

if  Xt  is  a  prior  miss  at  level  i  and  =  it 

otherwise 


(3) 


Notice  that  the  only  lines  that  ever  enter  level  i  are  the  least  priority  lines  yt{i  —  1)  from  lower  levels. 


5.1  Parallel  Simulation  Level  by  Level 

In  this  section,  we  present  a  rapid  parallel  simulation  algorithm  for  any  reference-based  replacement  policy. 
For  any  given  C  >  0,  the  objective  is  to  compute  the  level  C  stack  distances  Ai(C'),  A;v(C).  The 

algorithm  works  level  by  level,  like  the  LRU  simulation  algorithm  presented  in  Section  4.  Specifically,  we 
compute  the  cache  contents  the  replacees  yi(i)  and  the  stack  distances  A,(i)  at  level  i,  given  these 

same  quantities  at  level  i  —  1.  As  in  the  LRU  simulation,  just  0(1)  space  per  reference  is  needed,  with  the 
results  of  the  level  i  computation  overwriting  those  for  level  i  —  1. 

Level  1  is  easy:  s,(l)  =  i<,  y<(l)  =  s,_i(l),  Ar(l)  =  1  if  s<(l)  =  si-i(l),  and  A,(l)  =  oo  otherwise.  Two 
facts,  which  follow  from  equation  (3),  are  crucial  to  our  approach  for  computing  the  desired  results  for  level 
t  >  1: 


•  If  a  new  line,  say  a,  enters  level  t  >  1  at  time  t  (meaning  S((z)  =  a,  ^  a)  then  Xt  must  be  a 

prior  miss  at  level  i  and  a  must  be  the  rcplacee  yt(i  —1). 

•  Assuming  a  =  j/((i  —  1)  docs  enter  level  t  >  1,  it  propagates  until  coming  to  the  first  reference 

Xu,  u  >  I,  where  cither  Xu  =  a  or  ia  a  prior  miss  and  P{yt{i  —  1))  <  P{yu{i  —  !))•  That  is, 
S((j)  =  ...  =  Su_i(0  =  a.  U  ti  <  N  +  1  then  rcplacee  yu{i  —  1)  enters  level  i  at  time  u.  If  there  is  no 

such  u  <  N  +  I  then  the  replacee  yt{i  —  1)  propagates  on  through  time  N.  For  every  reference  Xt  let 

u{t)  denote  the  index  so  identified. 

Consider  the  graph  where  the  vertices  are  the  indices  t  of  all  prior  misses  xt  and  there  is  an  edge  from 

t  to  u{t).  This  edge  records  the  fact  that  if  yt{i  —  1)  enters  level  i  then  St(i)  =  ...  =  Su(()-l(0  =  !/((*  ^)) 

whereupon  it  is  replaced  by  1)  (if  u(t)  <  N  +  l).  Observe  that  not  all  such  j/((i-  1)  actually  do  enter 

level  i — for  instance,  in  Figure  4,  yio(2)  =  d  does  not  enter  level  3,  because  P(t/io(2))  <  P(s9(3)).  However, 
the  set  of  references  that  do  actually  enter  level  i  can  be  determined  by  following  the  maximal  path  through 
the  graph,  starting  at  vertex  1.  Replacee  yi(i)  =  0  enters  at  time  1  and  propagates  until  time  v  =  u(l)  -  1. 
If  u  <  A'’  +  1  then  replacee  j/u+i(i  -  1)  enters  level  i,  and  propagates  until  time  w  =  u(v)  —  1.  U  w  <  N  +  I 
then  replacee  yuj+i(i  1)  enters  level  i,  and  so  forth. 

Converting  this  serial  process  for  simulating  level  i  into  a  parallel  one,  we  simulate  level  i  >  1  as  follows: 

1.  [Compute  tentative  propagation  intervals.]  For  each  prior  miss  Xt,  compute 

stop{i)  =  least  s  >  t  such  that  P(x,)  >  P(X()>  *8  a  prior  miss  at  level  i  (4) 

next(t)  =  least  s  >  t  such  that  I,  =  x«, 

where,  by  convention,  stop(i)  and  next(i)  equal  Af  +  1  if  the  index  s  above  does  not  exist.  Set 

p(t)  =  min{next(<),  stop(i)}.  By  earlier  remarks,  ifyt(»— 1)  enters  level  i  then  «t(i)  =  . . .  =  ap(t)_i(i)  = 

yt{i-  1). 

2.  [Follow  propagation  chain.]  The  pointers  «(<)  determine  the  replacees  that  enter  level  i,  namely, 
those  with  indices:  0,  «(0),  u(u(0)),  . . . ,  with  the  sequence  stopping  at  v  =  u(. .  .u(0) . . .)  ^  N  +  1, 
u(u)  =  A'’  +  1.  Mark  these  replacees. 

3.  [Compute  level  i  results.]  For  each  marked  replacee  yi{i—  1)  and  each  v  in  the  interval  [<,  u{i)  —  1],  set 
s„(i)  =  yt{i  -  1).  For  every  xj  set  yt+i(0  to  maxp{s,(i),  y,+i(i  -  1)}.  Following  this,  if  x,  is  a  prior 
miss  at  level  i  -  1  and  if  s»_i(f)  =  X(  set  A,(i)  =  ».  Set  A((i)  =  Af(i  -  1)  for  prior  hits. 

Computing  the  stop{t)  values  is  an  instance  of  the  closest  larger  right  neighbor  (CLRN)  problem,  discussed 
in  Section  2.2.  It  can  be  solved  in  d(log  A'^)  time  using  N  PEs.  A  simpler  0(log^  A^)  time  solution  is  described 
below.  Computing  the  next{t)  values  via  sorting  is  described  in  Section  3;  the  time  needed  is  O(logAf)  Using 
N  PEs.  Marking  the  replacees  on  the  chain  of  pointers  from  0  to  Af  +  1  is  a  pointer  jumping  problem  [14], 
which  can  be  solved  in  0(log  N)  time  using  N/  log  N  PEs.  The  final  step  of  updating  the  level  t  cache 
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[1,2]  [3,4]  [5,6]  [7,8] 


[1,1]  [2,2]  [3.3]  [4,4]  [5.5]  [6,6]  [7,7]  [8,8] 


Figure  5:  Search  tree  identifying  maximum  values  over  subintervals,  which  is  used  to  solve  the  nearest  right 
neighbor  problem. 

contents  and  stack  distances  is  essentially  the  same  as  was  done  for  LRU  simulation  in  Section  4.  We  do  not 
repeat  the  details.  The  time  needed  is  O(logAf)  time  using  N/\ogN  PEs.  Summing  up,  the  total  time  is 
O(log  NYusmg  N  PEs.  To  produce  the  level  C  results,  the  computation  must  be  repeated  for  each  i  =  2, 

. . . ,  C.  Thus, 

Theorem  3  On  the  EREW  model,  given  the  trace  Xi,  . . . ,  x^,  the  associated  level  C  stack  distances  Ai(C'), 

. . .,  An{C)  induced  by  any  reference-based  replacement  policy  can  be  computed  in  time  O(ClogJV)  using  N 
PEs. 

We  close  this  section  with  a  simple  method  for  solving  the  CLRN  problem.  This  method  plays  the  key 
role  in  generalizing  the  simulation  method  to  accommodate  priority  aging  (Section  5.3). 

Let  ai,  ...,  ayv  be  a  sequence  of  N  numbers,  and,  for  simplicity,  assume  that  N  is  a  power  of  two  and 
Oat  =  +00.  For  each  a;,  i  =  1,  . . . ,  N  -  1,  we  wish  to  find  the  closest  right  larger  neighbor  aj]  i.e.,  j  >  i  is 
as  small  as  possible  and  Oi  <  aj.  Construct  a  binary  tree  over  the  inputs  as  illustrated  in  Figure  5.  Each 
node  is  labeled  with  the  maximum  value  of  the  inputs  in  its  subtree  and  with  the  corresponding  subrange 
of  indices.  To  find  the  closest  larger  right  neighbor  aj  of  ay  a  two  phase  search  is  initiated.  Phase  one  starts 
at  the  leaf  node  a,-,  and  progresses  in  steps  up  the  tree.  At  each  step  we  move  from  the  present  node  to  the 
nearest  internal  node  at  the  next  higher  level  whose  span  includes  a  node  to  the  right.  This  phase  ends  upon 
visiting  (i)  an  internal  node  which  is  rightmost  at  its  level,  or  (ii)  a  node  whose  value  is  (strictly)  greater 
than  aj.  In  the  example  of  Figure  5,  the  first  phase  for  03  visits  nodes  representing  ranges  [3,4]  and  [5,8].  In 
general,  the  first  phase  stops  at  a  node  spanning  an  interval  [m  +  S*,  m  +  2’'+^],  with  t  <  m  +  2*',  which  must 
contain  the  sought  aj.  In  phase  two  the  search  descends  down  to  aj,  at  each  step  moving  to  the  left  child  if 
the  left  child’s  value  exceeds  Oj,  or  to  the  right  child  otherwise.  On  a  concurrent  read  model,  carrying  out 
all  N  searches  in  parallel  gives  an  C>(log  AT)  time  solution.  On  an  exclusive  read  model,  standard  methods 
[14]  for  resolving  the  read  conflicts  add  an  0(log  AT)  factor,  bringing  the  total  time  to  ©(log’^  N). 
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5.2  Random  Replacement 

The  Random  Replacement  (RR)  rule  selects  the  line  to  replace  on  a  cache  miss  independently  and  uniformly 
at  random  from  the  set  of  cached  lines.  Mattson  et  al.  [16]  observed  that  the  selections  can  be  made  so  as 
preserve  the  stack  property  {Bt{i  —  1)  C  5((i);  for  all  i,i  >  1),  by  coupling  the  random  decisions  as  follows. 
Suppose  the  members  of  Bi(t  —  1)  have  been  ranked  randomly  from  1  to  i  —  1,  with  the  understanding 
that  the  higher  a  line’s  rank  the  lower  its  priority.  To  build  insert  s<(i)  into  the  priority  structure  by 

randomly  choosing  an  integer  rank  for  it  from  say  k.  Members  of  B<(t  —  1)  with  ranks  >  k  have  their 
ranks  incremented  by  one  in  Bt(i  —  1).  Other  members  of  B((i  —  1)  retain  their  rank.  Thus,  for  each  prior 
miss  xt  at  level  i,  we  may  decide  the  line  yt(i)  of  least  rank  in  Bt(t)  by  a  coin  toss:  with  probability  l/i, 
line  yt(t)  =  St-i(*))  the  complementary  probability  yt(i)  =  yi(i  —  1).  Moreover,  the  outcome  is 

completely  independent  of  S{_i(t), 

This  independence  can  be  exploited  to  considerably  simplify  the  simulation  of  level  i  over  that  described 
above.  Step  1  becomes:  For  each  prior  miss  compute  nexi{t)  as  before,  and  use  a  coin  toss  to  decide 
whether  to  label  index  t  3s  a.  “stopper”  (probability  l/i)  or  leave  the  index  unlabeled  (probability  1  —  l/i)- 
Let  stop{t)  be  the  least  stopper  u  >  t,  or  N  +  I  if  no  such  u  exists.  Let  u{t)  =  min{stop(t),  nex<(t)}  as 
before.  Steps  2  and  3  remain  the  same. 

Computing  the  stop{t)  values  entails  N  independent  coin  tosses  and  a  segmented  copy-scan  (cf.  Section 
2.2),  operations  that  net  0{\o^N)  time  using  N  PEs.  As  a  result,  we  obtain 

Theorem  4  On  the  EREW  model,  given  the  trace  xi,  Xff,  the  associated  level  C  stack  distances  Ai(C), 

. . .,  A^(C)  induced  by  the  RR  policy  can  be  computed  in  time  0{C\ogN)  using  N  PEs. 

Comparing  with  Theorem  3  we  see  that  dropping  the  CLRN  problem  and  the  probabilistic  algorithm 
used  to  solve  it  strengthens  the  running  time  bound  from  one  that  holds  with  high  probability  to  one  that 
holds  deterministically. 

5.3  Priority  Aging 

Under  any  reference-based  replacement  rule  other  than  RR,  a  line’s  priority  is  fixed  when  it  enters  the 
cache.  If  the  policy  is  a  practical  one,  then  it  is  likely  that  the  priority  is  a  simple  function  of  the  previous 
references.  For  example,  under  LFU  a  line’s  priority  is  the  number  of  earlier  references  to  the  line.  Past 
cache  activity  gives  an  imperfect  indication  of  future  cache  activity.  Under  LFU  a  fiurry  of  references  to 
a  small  set  of  lines  might  lead  to  their  long  retention  during  a  subsequent  period  when  the  lines  are  not 
needed.  To  counter  this,  it  is  natural  to  consider  policies  that  allow  a  line’s  priority  to  age]  i.e.,  to  decrease 
monotonically  while  the  line  remains  unreferenced.  In  this  Section,  we  extend  our  reference-based  simulation 
method  to  accommodate  aging. 

Let  4>  :  R  -*  R  he  a.  monotonically  decreasing  operator,  and  let  <f)^  represent  the  d^fold  application  of 
<!>.  Let  Pt{a)  denote  the  priority  of  a  line  a  held  in  the  cache  at  the  time  of  ®|.  We  consider  replacement 
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policies  where  the  initial  priority  of  line  i(  is  reference-based  (P((x()  satisfies  R1  of  Section  5),  but  the  line’s 
priority  “ages”  to  Pt+di^t)  =  <i>‘^iPt{^t))  in  the  cache  if  it  remains  unreferenced  throughout  time 

t  +  I,  . . . ,  t  +  d.  As  before,  the  replacement  policy  always  selects  the  line  with  least  priority.  Some  natural 
aging  operators  ^  are  <^(x)  =  x  —  a  for  some  fixed  a  >  0  or  <i>{x')  =  ax  for  some  fixed  a  G  (0, 1).  We  assume 
that  for  any  d  =  1,  . . . ,  —  1,  4>^{x)  can  be  computed  in  0(1)  time. 

Equation  (3)  describing  the  evolution  of  the  stack  levels  continues  to  hold.  A  little  thought  shows  that 
to  adapt  the  simulation  method  to  accommodate  aging,  we  need  only  change  the  definition  of  stop{t)  in 
equation  (4)  to 

..stop{t)  =  least  s  >  t  such  that  P,{x,)  >  4)^~*{Pt{xt)),  and  x,  is  a  prior  miss  at  level  i. 


Computing  these  new  values  stop(t)  can  be  posed  as  the  following  variant  of  the  CLRN  problem.  Let  oi,  ..., 
a;v  be  a  sequence  of  N  numbers,  and,  for  simplicity,  assume  that  Af  is  a  power  of  two  and  a^v  =  -boo.  For 
each  r  =:  1,  . . . ,  Af  —  1,  we  wish  to  find  the  smallest  j  >  i  such  that  ; 

<  aj. 


We  now  sketch  how  to  extend  the  binary  search  solution  given  in  Section  5  to  solve  the  new  problem. 
Let  ^  denote  the  inverse  of  <j).  Since  ^  is  monotone  increasing,  the  inequality  above  implies  to 

4>'‘(ai)  <  <^*'(aj)  for  all  nonnegative  integers  u,  v  with  u  +  v  =  j  —  i. 

Letting  [u,  v]  be  any  range  of  indices  with  «  >  i, 

for  all  j  G  [u,  v]  <^““‘(a,)  <  max{au,  ^(a„+i), . , . , 


This  equivalence  reveals  a  way  to  determine  whether  a,-  ages  below  some  aj,  j  G  [u,v],  and  i  <  u;  i.e., 
whether  <  aj.  For  j  >  u  define  to  be  the  “rejuvenated”  value  of  aj  with  respect  to  index 

u  (i.e.,  Oj’s  priority  if  “de-aged”  back  to  position  «).  Let  =  max{a„,  ^(a„+i), . . . ,  ^"“"(av)}  denote  the 
maiximum  (over  j  G  [w,  v])  rejuvenated  value  of  any  aj  with  respect  to  u.  To  find  out  if  a,-  ages  below  some 
Gj  with  j  G  [«,v],  we  may  simply  compare  ^““’(a,)  and  r„,„.  Since  a,-  is  arbitrary  in  this  discussion,  it  is 
possible  to  use  concurrently  in  many  searches. 

A  “rejuvenation-max”  tree  can  be  built  in  parallel  using  the  following  observation:  for  any  M  (assumed 
to  be  a  power  of  2) 


This  recursion  shows  we  can  build  a  rejuvenation-max  tree  over  ai,...,aig  in  log Af  steps.  At  every  step, 
all  nodes  at  a  given  level  of  the  tree  are  constructed.  Leaves  are  understood  to  be  at  level  log  AT,  the  root 
is  at  level  0.  A  node  spanning  an  interval  [u,v]  is  labeled  with  ru,v  The  recusion  shows  that  to  compute 
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Figure  6;  Rejuvenation-max  tree,  which  is  used  to  solve  the  nearest  right  neighbor  problem  when  aging  is , 
permitted.  In  this  example,  the  aging  operator  =  x/2. 

the  label  of  a  node  at  level  k,  one  computes  the  maximum  of  (i)  the  label  on  the  node’s  left-child,  and  (ii) 
the  label  on  the  node’s  right-child  promoted  by  an  operator  with  c  =  Thus,  the  parent  of  a  left 

node  with  value  a  and  right  node  with  value  b  has  label  max{a,  .^*(6)}.  Figure  6  illustrates  how  the  tree  of 
:*  Figure  5  is  modified  to  accommodate  the  aging  operator  4>{x)  =  x/2. 

Given  a,'  we  wish  to  find  the  closest  ay  to  the  right  such  that  <^~^(a()  <  ay.  The  same  two  phase  strategy 
described  in  Section  5  works,  replacing  the  comparison  of  the  value  a<  with  the  label  of  a  node  spanning  an 
interval  [u,  v]  with  the  comparison  of  di“~*(a,)  with  r„,„,  or  an  equivalent  comparison.  For  example,  consider 
Os’s  search  for  the  setup  of  Figure  5.  In  the  first  phase,  the  search  moves  up  and  to  the  right  in  the  tree, 
looking  over  successively  larger  intervals  for  a  value  that  03  ages  below.  First,  we  compare  03  =  6  with 
r3  4  =  6,  and  since  03  is  not  smaller  continue  the  first  phase.  The  next  node  visited  represents  [5,8].  We 
compare  <f>^{a3)  =  1.5  with  =  32,  and  as  03  is  smaller,  phase  one  stops  at  this  node.  In  the  second  phase, 
i  the  search  moves  down  from  [5,8]  to  locate  the  leftmost  ay  in  [5,8]  that  03  ages  below.  First,  we  branch 

left  to  [5,6],  because  rs.e  =  14  is  larger  than  <^^(a3)  =  1.5.  Second,  we  branch  left  again  to  [5,5]  because 
rs  5  =  3  >  .^^(03)  =  1.5.  Since  [5,5]  is  a  leaf,  the  search  stops,  having  located  the  right  match,  as,  for  03. 

•t 

Building  the  rejuvenation-max  tree  costs  0{\ogN)  time  using  N  PEs.  On  the  CREW  model,  we  may 
assign  one  processor  to  the  search  for  each  input,  and  so  obtain  an  O(logiV)  time  solution  using  N  PEs. 

‘  This  in  turn  implies  an  0(log^  N)  time  solution  using  N  PEs  on  the  EREW  model.  The  final  result  is: 

Theorem  5  On  ike  EREW  model,  given  ike  trace  xi,  ...,  xj^,  the  level  C  stack  distances  Ai(C'),  ..., 
i  An{C)  induced  by  any  reference-based  policy  with  aging  can  be  computed  in  time  0{C\o^  N)  using  N  PEs. 

• 

6  Summary 

Trace  driven  cache  simulation  is  an  important  tool  used  in  the  design  of  computer  systems.  Parallel  pro¬ 
cessing  offers  the  promise  of  reducing  the  time  required  to  execute  a  cache  simulation,  and  hence  reduce  the 
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overall  cache  design  time.  VVe  have  shown  how  massively  parallel  SIMD  architectures  can  be  applied  to  this 
important  problem  area. 

VVe  note  that  the  bottleneck  problem  in  our  simulation  of  reference-based  replacement  policies  is  the 
closest  larger  right  neighbor  problem.  An  improvement  in  the  solution  of  this  problem  to  O(logiV)  time 
(without  using  probabilistic  methods)  and  N  PEs  on  the  EREW  model  appears  possible  [7].  This  would 
improve  the  reference-based  simulation  method  to  deterministic  0{C\ogN)  time  using  N  PEs, 

A  number  of  important  issues  remain.  There  is  a  class  of  “clock-based”  stack  algorithms  which  do  not 
appear  to  fit  within  our  framework.  The  classical  clock  algorithm  [6]  associates  one  bit  with  each  physical 
line  in  the  cache.'  The  bit  is  set  whenever  a  new  line  is  written  into  the  physical  location.  A  clock  counter 
determines  replacement  lines.  On  a  hit  the  counter  is  untouched,  but  on  a  miss,  the  counter  scans  the  set 
for  a  clear  bit;  the  first  clear  bit  found  identifies  the  replacement  line.  The  scan  begins  where  the  counter 
was  last  left,  and  any  set  bit  encountered  in  the  scan  is  cleared.  Thus,  at  most  one  scan  of  the  set  is  needed 
to  find  a  clear  bit.  The  clock  algorithm  is  induced  if  we  assign  priority  d  to  line  r  if  d  lines  must  be  scanned 
by  the  clock  before  choosing  r  as  the  replacement.  A  line’s  priority  ages  as  it  sits  in  the  cache,  but  in  a 
highly  state-dependent  way.  One  object  of  our  future  research  is  to  determine  whether  clock-based  stack 
algorithms  can  be  simulated  in  parallel. 

We  believe  that  the  geometric  methods  used  to  obtain  the  fast,  set  size  independent  LRU  simulation 
method  of  Section  3  might  yield  similar  simulation  methods  for  OPT  and  for  general  reference-based  policies. 
Another  important  issue  is  whether  these  techniques  can  be  extended  to  the  simulation  of  multiprocessor 
caches.  Yet  another  issue  is  the  use  of  SIMD  processors  to  generate  synthetic  cache  traces.  The  method 
discussed  in  [20]  is  basically  LRU  “in  reverse”:  given  the  stack  distances,  compute  the  reference  string.  We 
believe  we  can  irhplement  this  method  in  poly-log  time  using  ideas  similar  to  those  developed  here.  Given 
the  promise  of  SIMD  trace-driven  simulation,  a  more  comprehensive  study  of  parallelized  synthetic  trace 
generation  will  be  useful. 
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